This code covers chapter 7 of “Introduction to Data Mining” by Pang-Ning Tan, Michael Steinbach and Vipin Kumar. See table of contents for code examples for other chapters.
This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
We will use here a small and very clean dataset called Ruspini which is included in the R package cluster.
data(ruspini, package = "cluster")
The Ruspini data set, consisting of 75 points in four groups that is popular for illustrating clustering techniques. It is a very simple data set with well separated clusters. The original dataset has the points ordered by group. We can shuffle the data (rows) using sample_frac which samples by default 100%.
ruspini <- as_tibble(ruspini) %>% sample_frac()
ruspini
## # A tibble: 75 x 2
## x y
## <int> <int>
## 1 33 154
## 2 85 115
## 3 24 58
## 4 64 30
## 5 60 136
## 6 46 142
## 7 61 25
## 8 47 149
## 9 31 60
## 10 66 18
## # … with 65 more rows
ggplot(ruspini, aes(x = x, y = y)) + geom_point()
summary(ruspini)
## x y
## Min. : 4.00 Min. : 4.00
## 1st Qu.: 31.50 1st Qu.: 56.50
## Median : 52.00 Median : 96.00
## Mean : 54.88 Mean : 92.03
## 3rd Qu.: 76.50 3rd Qu.:141.50
## Max. :117.00 Max. :156.00
For most clustering algorithms it is necessary to handle missing values and outliers (e.g., remove the observations). For details see Section “Outlier removal” below. This data set has not missing values or strong outlier and looks like it has some very clear groups.
Clustering algorithms use distances and the variables with the largest number range will dominate distance calculation. The summary above shows that this is not an issue for the Ruspini dataset with both, x and y, being roughly between 0 and 150. Most data analysts will still scale each column in the data to zero mean and unit standard deviation (z-scores). Note: The standard scale() function scales a whole data matrix so we implement a function for a single vector and apply it to all numeric columns.
# I use this till tidyverse implements a scale function
scale_numeric <- function(x) x %>% mutate_if(is.numeric, function(y) as.vector(scale(y)))
ruspini_scaled <- ruspini %>% scale_numeric()
summary(ruspini_scaled)
## x y
## Min. :-1.66806 Min. :-1.80743
## 1st Qu.:-0.76649 1st Qu.:-0.72946
## Median :-0.09442 Median : 0.08158
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.70879 3rd Qu.: 1.01582
## Max. : 2.03655 Max. : 1.31355
After scaling, most z-scores will fall in the range \([-3,3]\) (z-scores are measured in standard deviations from the mean), where \(0\) means average.
k-means implicitly assumes Euclidean distances. We use \(k = 4\) clusters and run the algorithm 10 times with random initialized centroids. The best result is returned.
km <- kmeans(ruspini_scaled, centers = 4, nstart = 10)
km
## K-means clustering with 4 clusters of sizes 17, 15, 20, 23
##
## Cluster means:
## x y
## 1 1.4194387 0.4692907
## 2 0.4607268 -1.4912271
## 3 -1.1385941 -0.5559591
## 4 -0.3595425 1.1091151
##
## Clustering vector:
## [1] 4 1 3 2 4 4 2 4 3 2 3 3 1 4 2 1 1 3 4 1 2 4 1 2 3 4 4 3 3 1 4 1 2 3 3 1 4 4
## [39] 3 3 1 4 3 3 2 4 4 3 3 1 4 2 1 1 1 2 4 4 4 1 4 3 3 4 3 4 1 2 3 2 2 1 2 2 4
##
## Within cluster sum of squares by cluster:
## [1] 3.641276 1.082373 2.705477 2.658679
## (between_SS / total_SS = 93.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
km is an R object implemented as a list. The clustering vector contains the cluster assignment for each data row and can be accessed using km$cluster. I add the cluster assignment as a column to the scaled dataset (I make it a factor since it represents a nominal label).
ruspini_clustered <- ruspini_scaled %>% add_column(cluster = factor(km$cluster))
ruspini_clustered
## # A tibble: 75 x 3
## x y cluster
## <dbl> <dbl> <fct>
## 1 -0.717 1.27 4
## 2 0.987 0.472 1
## 3 -1.01 -0.699 3
## 4 0.299 -1.27 2
## 5 0.168 0.903 4
## 6 -0.291 1.03 4
## 7 0.201 -1.38 2
## 8 -0.258 1.17 4
## 9 -0.783 -0.658 3
## 10 0.365 -1.52 2
## # … with 65 more rows
ggplot(ruspini_clustered, aes(x = x, y = y, color = cluster)) + geom_point()
Add the centroids to the plot.
centroids <- as_tibble(km$centers, rownames = "cluster")
centroids
## # A tibble: 4 x 3
## cluster x y
## <chr> <dbl> <dbl>
## 1 1 1.42 0.469
## 2 2 0.461 -1.49
## 3 3 -1.14 -0.556
## 4 4 -0.360 1.11
ggplot(ruspini_clustered, aes(x = x, y = y, color = cluster)) + geom_point() +
geom_point(data = centroids, aes(x = x, y = y, color = cluster), shape = 3, size = 10)
Use the factoextra package for visualization
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_cluster(km, data = ruspini_scaled, centroids = TRUE, repel = TRUE, ellipse.type = "norm")
## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We inspect the clusters created by the 4-cluster k-means solution. The following code can be adapted to be used for other clustering methods.
Inspect the centroids with horizontal bar charts organized by cluster. To group the plots by cluster, we have to change the data format to the “long”-format using a pivot operation.
ggplot(pivot_longer(centroids, cols = c(x, y), names_to = "feature"),
aes(x = value, y = feature)) +
geom_bar(stat = "identity") +
facet_grid(rows = vars(cluster))
You need is to filter the rows corresponding to the cluster index. The next example calculates summary statistics and then plots all data points of cluster 1.
cluster1 <- ruspini_clustered %>% filter(cluster == 1)
cluster1
## # A tibble: 17 x 3
## x y cluster
## <dbl> <dbl> <fct>
## 1 0.987 0.472 1
## 2 1.81 0.390 1
## 3 1.38 0.615 1
## 4 1.84 0.698 1
## 5 1.02 0.821 1
## 6 1.97 0.513 1
## 7 1.74 0.492 1
## 8 1.45 0.739 1
## 9 1.45 0.554 1
## 10 2.04 0.472 1
## 11 1.41 0.492 1
## 12 0.627 0.0816 1
## 13 0.758 0.0405 1
## 14 1.51 0.472 1
## 15 1.74 0.390 1
## 16 1.41 0.657 1
## 17 0.987 0.0816 1
summary(cluster1)
## x y cluster
## Min. :0.6268 Min. :0.04052 1:17
## 1st Qu.:1.0202 1st Qu.:0.38958 2: 0
## Median :1.4464 Median :0.49224 3: 0
## Mean :1.4194 Mean :0.46929 4: 0
## 3rd Qu.:1.7415 3rd Qu.:0.61544
## Max. :2.0366 Max. :0.82076
ggplot(cluster1, aes(x = x, y = y)) + geom_point() +
coord_cartesian(xlim = c(-2, 2), ylim = c(-2, 2))
What happens if we try to cluster with 8 centers?
fviz_cluster(kmeans(ruspini_scaled, centers = 8), data = ruspini_scaled,
centroids = TRUE, geom = "point", ellipse.type = "norm")
## Too few points to calculate an ellipse
dist defaults to method=“Euclidean”
d <- dist(ruspini_scaled)
We cluster using complete link
hc <- hclust(d, method = "complete")
Hierarchical clustering does not return cluster assignments but a dendrogram. The standard plot function plots the dendrogram.
plot(hc)
Use factoextra (ggplot version). We can specify the number of clusters to visualize how the dendrogram will be cut into clusters.
fviz_dend(hc, k = 4)
More plotting options for dendrograms, including plotting parts of large dendrograms can be found here.
Extract cluster assignments by cutting the dendrogram into four parts and add the cluster id to the data.
clusters <- cutree(hc, k = 4)
cluster_complete <- ruspini_scaled %>%
add_column(cluster = factor(clusters))
cluster_complete
## # A tibble: 75 x 3
## x y cluster
## <dbl> <dbl> <fct>
## 1 -0.717 1.27 1
## 2 0.987 0.472 2
## 3 -1.01 -0.699 3
## 4 0.299 -1.27 4
## 5 0.168 0.903 1
## 6 -0.291 1.03 1
## 7 0.201 -1.38 4
## 8 -0.258 1.17 1
## 9 -0.783 -0.658 3
## 10 0.365 -1.52 4
## # … with 65 more rows
ggplot(cluster_complete, aes(x, y, color = cluster)) +
geom_point()
Try 8 clusters (Note: fviz_cluster needs a list with data and the cluster labels for hclust)
fviz_cluster(list(data = ruspini_scaled, cluster = cutree(hc, k = 8)), geom = "point")
Clustering with single link
hc_single <- hclust(d, method = "single")
fviz_dend(hc_single, k = 4)
fviz_cluster(list(data = ruspini_scaled, cluster = cutree(hc_single, k = 4)), geom = "point")
library(dbscan)
Parameters: minPts defines how many points in the epsilon neighborhood are needed to make a point a core point. It is often chosen as a smoothing parameter. I use here minPts = 4.
To decide on epsilon, the knee in the kNN distance plot is often used. Note that minPts contains the point itself, while the k-nearest neighbor does not. We therefore have to use k = minPts - 1! The knee is around eps = .32.
kNNdistplot(ruspini_scaled, k = 3)
abline(h = .32, col = "red")
run dbscan
db <- dbscan(ruspini_scaled, eps = .32, minPts = 4)
db
## DBSCAN clustering for 75 objects.
## Parameters: eps = 0.32, minPts = 4
## The clustering contains 4 cluster(s) and 5 noise points.
##
## 0 1 2 3 4
## 5 23 20 15 12
##
## Available fields: cluster, eps, minPts
str(db)
## List of 3
## $ cluster: int [1:75] 1 0 2 3 1 1 3 1 2 3 ...
## $ eps : num 0.32
## $ minPts : num 4
## - attr(*, "class")= chr [1:2] "dbscan_fast" "dbscan"
ggplot(ruspini_scaled %>% add_column(cluster = factor(db$cluster)),
aes(x, y, color = cluster)) + geom_point()
Note: Cluster 0 represents outliers).
fviz_cluster(db, ruspini_scaled, geom = "point")
Play with eps (neighborhood size) and MinPts (minimum of points needed for core cluster)
Also called \(k\)-medoids. Similar to \(k\)-means, but uses medoids instead of centroids to represent clusters and works on a precomputed distance matrix. An advantage is that you can use any distance metric not just Euclidean distances. Note: The medoid is the most central data point in the middle of the cluster.
library(cluster)
##
## Attaching package: 'cluster'
## The following object is masked _by_ '.GlobalEnv':
##
## ruspini
d <- dist(ruspini_scaled)
str(d)
## 'dist' num [1:2775] 1.883 1.993 2.741 0.959 0.492 ...
## - attr(*, "Size")= int 75
## - attr(*, "Diag")= logi FALSE
## - attr(*, "Upper")= logi FALSE
## - attr(*, "method")= chr "euclidean"
## - attr(*, "call")= language dist(x = ruspini_scaled)
p <- pam(d, k = 4)
p
## Medoids:
## ID
## [1,] 19 19
## [2,] 36 36
## [3,] 48 48
## [4,] 24 24
## Clustering vector:
## [1] 1 2 3 4 1 1 4 1 3 4 3 3 2 1 4 2 2 3 1 2 4 1 2 4 3 1 1 3 3 2 1 2 4 3 3 2 1 1
## [39] 3 3 2 1 3 3 4 1 1 3 3 2 1 4 2 2 2 4 1 1 1 2 1 3 3 1 3 1 2 4 3 4 4 2 4 4 1
## Objective function:
## build swap
## 0.4422977 0.3187056
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call"
ruspini_clustered <- ruspini_scaled %>% add_column(cluster = factor(p$cluster))
medoids <- as_tibble(ruspini_scaled[p$medoids, ], rownames = "cluster")
medoids
## # A tibble: 4 x 3
## cluster x y
## <chr> <dbl> <dbl>
## 1 1 -0.357 1.17
## 2 2 1.45 0.554
## 3 3 -1.18 -0.555
## 4 4 0.463 -1.46
ggplot(ruspini_clustered, aes(x = x, y = y, color = cluster)) + geom_point() +
geom_point(data = medoids, aes(x = x, y = y, color = cluster), shape = 3, size = 10)
# __Note:__ `fviz_cluster` needs the original data.
fviz_cluster(c(p, list(data = ruspini_scaled)), geom = "point", ellipse.type = "norm")
library(mclust)
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:purrr':
##
## map
Mclust uses Bayesian Information Criterion (BIC) to find the number of clusters (model selection). BIC uses the likelihood and a penalty term to guard against overfitting.
m <- Mclust(ruspini_scaled)
summary(m)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEI (diagonal, equal volume and shape) model with 5 components:
##
## log-likelihood n df BIC ICL
## -91.26485 75 16 -251.6095 -251.7486
##
## Clustering table:
## 1 2 3 4 5
## 23 3 20 15 14
plot(m, what = "classification")
Rerun with a fixed number of 4 clusters
m <- Mclust(ruspini_scaled, G=4)
summary(m)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEI (diagonal, equal volume and shape) model with 4 components:
##
## log-likelihood n df BIC ICL
## -101.6027 75 13 -259.3327 -259.3356
##
## Clustering table:
## 1 2 3 4
## 23 17 20 15
plot(m, what = "classification")
Spectral clustering works by embedding the data points of the partitioning problem into the subspace of the k largest eigenvectors of a normalized affinity/kernel matrix. Then uses a simple clustering method like k-means.
library("kernlab")
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
cluster_spec <- specc(as.matrix(ruspini_scaled), centers = 4)
cluster_spec
## Spectral Clustering object of class "specc"
##
## Cluster memberships:
##
## 1 3 4 2 1 1 2 1 4 2 4 4 3 1 2 3 3 4 1 3 2 1 3 2 4 1 1 4 4 3 1 3 2 4 4 3 1 1 4 4 3 1 4 4 2 1 1 4 4 3 1 2 3 3 3 2 1 1 1 3 1 4 4 1 4 1 3 2 4 2 2 3 2 2 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 41.7670067458421
##
## Centers:
## [,1] [,2]
## [1,] -0.3595425 1.1091151
## [2,] 0.4607268 -1.4912271
## [3,] 1.4194387 0.4692907
## [4,] -1.1385941 -0.5559591
##
## Cluster size:
## [1] 23 15 17 20
##
## Within-cluster sum of squares:
## [1] 49.061745 57.268006 20.047164 7.929369
ggplot(ruspini_scaled %>% add_column(cluster = factor(cluster_spec)),
aes(x, y, color = cluster)) + geom_point()
The fuzzy version of the known k-means clustering algorithm. Each data point has a degree of membership to for each cluster.
library("e1071")
cluster_cmeans <- cmeans(as.matrix(ruspini_scaled), centers = 4)
cluster_cmeans
## Fuzzy c-means clustering with 4 clusters
##
## Cluster centers:
## x y
## 1 0.4552396 -1.4760174
## 2 1.5047495 0.5160655
## 3 -1.1371362 -0.5549795
## 4 -0.3763191 1.1143346
##
## Memberships:
## 1 2 3 4
## [1,] 0.0146290735 2.370801e-02 0.0371530077 9.245099e-01
## [2,] 0.0536184174 8.109455e-01 0.0392594423 9.617660e-02
## [3,] 0.0127759825 4.511103e-03 0.9731671576 9.545757e-03
## [4,] 0.9523625551 1.337417e-02 0.0241500166 1.011326e-02
## [5,] 0.0448216677 1.328692e-01 0.0672233016 7.550858e-01
## [6,] 0.0021823678 4.268953e-03 0.0046270934 9.889216e-01
## [7,] 0.9470190151 1.340794e-02 0.0287382847 1.083477e-02
## [8,] 0.0022375111 4.752125e-03 0.0044841069 9.885263e-01
## [9,] 0.0549640356 1.831454e-02 0.8900885379 3.663288e-02
## [10,] 0.9936328626 1.852953e-03 0.0031667503 1.347434e-03
## [11,] 0.0138086834 6.485655e-03 0.9626082111 1.709745e-02
## [12,] 0.0971239314 4.137713e-02 0.7683800897 9.311885e-02
## [13,] 0.0192372091 9.507893e-01 0.0106802506 1.929329e-02
## [14,] 0.0139399081 3.404968e-02 0.0247843290 9.272261e-01
## [15,] 0.9503614981 1.817958e-02 0.0205929588 1.086596e-02
## [16,] 0.0047471442 9.845898e-01 0.0032206790 7.442329e-03
## [17,] 0.0205451062 9.395400e-01 0.0130810618 2.683385e-02
## [18,] 0.0633451086 2.129943e-02 0.8727576482 4.259781e-02
## [19,] 0.0004511883 8.878840e-04 0.0009642677 9.976967e-01
## [20,] 0.0461271984 7.877459e-01 0.0394161051 1.267108e-01
## [21,] 0.9839903294 4.422491e-03 0.0082407141 3.346465e-03
## [22,] 0.0133133010 1.972778e-02 0.0376047279 9.293542e-01
## [23,] 0.0318397200 9.158212e-01 0.0184330517 3.390601e-02
## [24,] 0.9997656129 7.425081e-05 0.0001096041 5.053214e-05
## [25,] 0.0336331346 1.797579e-02 0.8953157712 5.307530e-02
## [26,] 0.0048166974 8.407785e-03 0.0114697255 9.753058e-01
## [27,] 0.0059279865 1.353920e-02 0.0112520579 9.692808e-01
## [28,] 0.0593827892 2.497431e-02 0.8604286722 5.521422e-02
## [29,] 0.0102385085 4.907486e-03 0.9712822552 1.357175e-02
## [30,] 0.0099628779 9.728621e-01 0.0058700369 1.130502e-02
## [31,] 0.0047497126 7.760547e-03 0.0120644622 9.754253e-01
## [32,] 0.0087247275 9.702930e-01 0.0061526118 1.482961e-02
## [33,] 0.9591003116 1.454359e-02 0.0173388073 9.017287e-03
## [34,] 0.0155195481 7.843921e-03 0.9540755069 2.256102e-02
## [35,] 0.0740343148 2.108363e-02 0.8665089076 3.837315e-02
## [36,] 0.0009432377 9.971243e-01 0.0006089244 1.323528e-03
## [37,] 0.0046347730 9.539690e-03 0.0095407338 9.762848e-01
## [38,] 0.0078904149 1.728034e-02 0.0151637759 9.596655e-01
## [39,] 0.0220728697 9.896871e-03 0.9426476422 2.538262e-02
## [40,] 0.0272601309 1.152720e-02 0.9343457415 2.686692e-02
## [41,] 0.0405236630 8.956434e-01 0.0229241744 4.090881e-02
## [42,] 0.0256677738 3.625362e-02 0.0758500020 8.622286e-01
## [43,] 0.0405048583 1.461004e-02 0.9148706567 3.001445e-02
## [44,] 0.0034024778 1.359415e-03 0.9920953538 3.142753e-03
## [45,] 0.9492340337 1.926396e-02 0.0203873422 1.111466e-02
## [46,] 0.0014971882 2.766855e-03 0.0033846275 9.923513e-01
## [47,] 0.0041347057 6.877953e-03 0.0103260940 9.786612e-01
## [48,] 0.0004363843 1.837834e-04 0.9989327639 4.470684e-04
## [49,] 0.0339788387 1.644881e-02 0.9045021176 4.507024e-02
## [50,] 0.0018398943 9.945448e-01 0.0011597574 2.455542e-03
## [51,] 0.0155185355 2.380060e-02 0.0419821512 9.186987e-01
## [52,] 0.9537433681 1.775437e-02 0.0183427182 1.015955e-02
## [53,] 0.1837549247 4.702624e-01 0.1283047358 2.176779e-01
## [54,] 0.1773864764 5.412231e-01 0.1075062014 1.738842e-01
## [55,] 0.0004109719 9.988318e-01 0.0002500025 5.072029e-04
## [56,] 0.9921767781 2.298183e-03 0.0038610595 1.663980e-03
## [57,] 0.0199937127 5.510138e-02 0.0333624447 8.915425e-01
## [58,] 0.0519056794 1.046579e-01 0.0920672640 7.513692e-01
## [59,] 0.0114701041 2.852451e-02 0.0204613990 9.395440e-01
## [60,] 0.0135401379 9.650085e-01 0.0075748303 1.387652e-02
## [61,] 0.0103550074 1.728603e-02 0.0256382639 9.467207e-01
## [62,] 0.0358203010 1.143260e-02 0.9302636438 2.248345e-02
## [63,] 0.0290862363 1.010342e-02 0.9397670272 2.104332e-02
## [64,] 0.0509711232 1.760213e-01 0.0713880019 7.016195e-01
## [65,] 0.0431080201 1.998634e-02 0.8877333742 4.917227e-02
## [66,] 0.0007051347 1.321817e-03 0.0015601047 9.964129e-01
## [67,] 0.0050412521 9.834305e-01 0.0034556962 8.072521e-03
## [68,] 0.9472523729 1.645554e-02 0.0249345610 1.135752e-02
## [69,] 0.0452757332 2.771401e-02 0.8288386580 9.817160e-02
## [70,] 0.8921797458 4.590736e-02 0.0384835253 2.342937e-02
## [71,] 0.9125433338 2.037900e-02 0.0498112622 1.726641e-02
## [72,] 0.1188567952 7.056387e-01 0.0654645927 1.100399e-01
## [73,] 0.9934966405 2.028556e-03 0.0030790004 1.395804e-03
## [74,] 0.9531436928 1.186793e-02 0.0254625689 9.525809e-03
## [75,] 0.0168051985 2.448057e-02 0.0483839330 9.103303e-01
##
## Closest hard clustering:
## [1] 4 2 3 1 4 4 1 4 3 1 3 3 2 4 1 2 2 3 4 2 1 4 2 1 3 4 4 3 3 2 4 2 1 3 3 2 4 4
## [39] 3 3 2 4 3 3 1 4 4 3 3 2 4 1 2 2 2 1 4 4 4 2 4 3 3 4 3 4 2 1 3 1 1 2 1 1 4
##
## Available components:
## [1] "centers" "size" "cluster" "membership" "iter"
## [6] "withinerror" "call"
Plot membership (shown as small pie charts)
library("scatterpie")
ggplot() +
geom_scatterpie(data = cbind(ruspini_scaled, cluster_cmeans$membership),
aes(x = x, y = y), cols = colnames(cluster_cmeans$membership), legend_name = "Membership") + coord_equal()
Look at the within.cluster.ss and the avg.silwidth
#library(fpc)
Notes: * I do not load fpc since the NAMESPACE overwrites dbscan. * The clustering (second argument below) has to be supplied as a vector with numbers (cluster IDs) and cannot be a factor (use as.integer() to convert the factor to an ID).
fpc::cluster.stats(d, km$cluster)
## $n
## [1] 75
##
## $cluster.number
## [1] 4
##
## $cluster.size
## [1] 17 15 20 23
##
## $min.cluster.size
## [1] 15
##
## $noisen
## [1] 0
##
## $diameter
## [1] 1.4627043 0.8359025 1.1192822 1.1591436
##
## $average.distance
## [1] 0.5805551 0.3564457 0.4824376 0.4286139
##
## $median.distance
## [1] 0.5023855 0.3379729 0.4492432 0.3934100
##
## $separation
## [1] 0.767612 1.157682 1.157682 0.767612
##
## $average.toother
## [1] 2.293318 2.307969 2.157193 2.148527
##
## $separation.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.000000 1.308435 1.339721 0.767612
## [2,] 1.308435 0.000000 1.157682 1.957726
## [3,] 1.339721 1.157682 0.000000 1.219930
## [4,] 0.767612 1.957726 1.219930 0.000000
##
## $ave.between.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.000000 2.220011 2.771960 1.924915
## [2,] 2.220011 0.000000 1.874198 2.750174
## [3,] 2.771960 1.874198 0.000000 1.887363
## [4,] 1.924915 2.750174 1.887363 0.000000
##
## $average.between
## [1] 2.219257
##
## $average.within
## [1] 0.4629733
##
## $n.between
## [1] 2091
##
## $n.within
## [1] 684
##
## $max.diameter
## [1] 1.462704
##
## $min.separation
## [1] 0.767612
##
## $within.cluster.ss
## [1] 10.08781
##
## $clus.avg.silwidths
## 1 2 3 4
## 0.6812849 0.8073733 0.7211353 0.7454551
##
## $avg.silwidth
## [1] 0.7368082
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.8415597
##
## $dunn
## [1] 0.5247896
##
## $dunn2
## [1] 3.228286
##
## $entropy
## [1] 1.37327
##
## $wb.ratio
## [1] 0.2086163
##
## $ch
## [1] 323.5512
##
## $cwidegap
## [1] 0.4149825 0.2351854 0.2611701 0.3152817
##
## $widestgap
## [1] 0.4149825
##
## $sindex
## [1] 0.8583457
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Read ? cluster.stats for an explanation of all the available indices.
sapply(
list(
km = km$cluster,
hc_compl = cutree(hc, k = 4),
hc_single = cutree(hc_single, k = 4)
),
FUN = function(x)
fpc::cluster.stats(d, x))[c("within.cluster.ss", "avg.silwidth"), ]
## km hc_compl hc_single
## within.cluster.ss 10.08781 10.08781 10.08781
## avg.silwidth 0.7368082 0.7368082 0.7368082
library(cluster)
plot(silhouette(km$cluster, d))
Note: The silhouette plot does not show correctly in R Studio if you have too many objects (bars are missing). I will work when you open a new plotting device with windows(), x11() or quartz(). You can find ggplot versions of the plot.
ggplot visualization using factoextra
fviz_silhouette(silhouette(km$cluster, d))
## cluster size ave.sil.width
## 1 1 17 0.68
## 2 2 15 0.81
## 3 3 20 0.72
## 4 4 23 0.75
ggplot(ruspini_scaled, aes(x, y)) + geom_point()
# We will use different methods and try 1-10 clusters.
set.seed(1234)
ks <- 2:10
Use within sum of squares and look for the knee (nstart = 5 repeats k-means 5 times and returns the best solution)
WSS <- sapply(ks, FUN = function(k) {
kmeans(ruspini_scaled, centers = k, nstart = 5)$tot.withinss
})
ggplot(as_tibble(ks, WSS), aes(ks, WSS)) + geom_line() +
geom_vline(xintercept = 4, color = "red", linetype = 2)
Use average silhouette width (look for the max)
ASW <- sapply(ks, FUN=function(k) {
fpc::cluster.stats(d, kmeans(ruspini_scaled, centers=k, nstart = 5)$cluster)$avg.silwidth
})
best_k <- ks[which.max(ASW)]
best_k
## [1] 4
ggplot(as_tibble(ks, ASW), aes(ks, ASW)) + geom_line() +
geom_vline(xintercept = best_k, color = "red", linetype = 2)
Use Dunn index (another internal measure given by min. separation/ max. diameter)
DI <- sapply(ks, FUN=function(k) {
fpc::cluster.stats(d, kmeans(ruspini_scaled, centers=k, nstart=5)$cluster)$dunn
})
best_k <- ks[which.max(DI)]
ggplot(as_tibble(ks, DI), aes(ks, DI)) + geom_line() +
geom_vline(xintercept = best_k, color = "red", linetype = 2)
Compares the change in within-cluster dispersion with that expected from a null model (see ? clusGap). The default method is to choose the smallest k such that its value Gap(k) is not more than 1 standard error away from the first local maximum.
library(cluster)
k <- clusGap(ruspini_scaled, FUN = kmeans, nstart = 10, K.max = 10)
k
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = ruspini_scaled, FUNcluster = kmeans, K.max = 10, nstart = 10)
## B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
## logW E.logW gap SE.sim
## [1,] 3.497911 3.467106 -0.03080497 0.03570312
## [2,] 3.073937 3.150184 0.07624761 0.03743636
## [3,] 2.678004 2.902669 0.22466432 0.03796393
## [4,] 2.106416 2.703517 0.59710065 0.03633061
## [5,] 1.987213 2.569927 0.58271427 0.03474390
## [6,] 1.863936 2.451066 0.58712975 0.03650821
## [7,] 1.732234 2.347809 0.61557504 0.03954281
## [8,] 1.657538 2.256196 0.59865860 0.04131903
## [9,] 1.579299 2.172143 0.59284444 0.04090234
## [10,] 1.493402 2.092989 0.59958708 0.03933863
plot(k)
Note: these methods can also be used for hierarchical clustering.
There have been many other methods and indices proposed to determine the number of clusters. See, e.g., package NbClust.
ggplot(ruspini_scaled, aes(x, y, color = factor(km$cluster))) + geom_point()
d <- dist(ruspini_scaled)
Inspect the distance matrix between the first 5 objects.
as.matrix(d)[1:5, 1:5]
## 1 2 3 4 5
## 1 0.0000000 1.8834833 1.993107 2.741410 0.9592325
## 2 1.8834833 0.0000000 2.317131 1.876169 0.9261071
## 3 1.9931075 2.3171315 0.000000 1.431856 1.9894537
## 4 2.7414097 1.8761691 1.431856 0.000000 2.1804213
## 5 0.9592325 0.9261071 1.989454 2.180421 0.0000000
A false-color image visualizes each value in the matrix as a pixel with the color representing the value.
library(seriation)
pimage(d)
Rows and columns are the objects as they are ordered in the data set. The diagonal represents the distance between an object and itself and has by definition a distance of 0 (dark line). Visualizing the unordered distance matrix does not show much structure, but we can reorder the matrix (rows and columns) using the k-means cluster labels from cluster 1 to 4. A clear block structure representing the clusters becomes visible.
pimage(d, order=order(km$cluster))
Plot function dissplot in package seriation rearranges the matrix and adds lines and cluster labels. In the lower half of the plot, it shows average dissimilarities between clusters. The function organizes the objects by cluster and then reorders clusters and objects within clusters so that more similar objects are closer together.
dissplot(d, labels = km$cluster, options=list(main="k-means with k=4"))
The reordering by dissplot makes the misspecification of k visible as blocks.
dissplot(d, labels = kmeans(ruspini_scaled, centers = 3)$cluster)
dissplot(d, labels = kmeans(ruspini_scaled, centers = 9)$cluster)
Using factoextra
fviz_dist(d)
External cluster validation uses ground truth information. That is, the user has an idea how the data should be grouped. This could be a known class label not provided to the clustering algorithm.
We use an artificial data set with known groups.
library(mlbench)
set.seed(1234)
shapes <- mlbench.smiley(n = 500, sd1 = 0.1, sd2 = 0.05)
plot(shapes)
Prepare data
truth <- as.integer(shapes$class)
shapes <- scale(shapes$x)
colnames(shapes) <- c("x", "y")
shapes <- as_tibble(shapes)
ggplot(shapes, aes(x, y)) + geom_point()
Find optimal number of Clusters for k-means
ks <- 2:20
Use within sum of squares (look for the knee)
WSS <- sapply(ks, FUN = function(k) {
kmeans(shapes, centers = k, nstart = 10)$tot.withinss
})
ggplot(as_tibble(ks, WSS), aes(ks, WSS)) + geom_line()
Looks like it could be 7 clusters
km <- kmeans(shapes, centers = 7, nstart = 10)
ggplot(shapes %>% add_column(cluster = factor(km$cluster)), aes(x, y, color = cluster)) +
geom_point()
Hierarchical clustering: We use single-link because of the mouth is non-convex and chaining may help.
d <- dist(shapes)
hc <- hclust(d, method = "single")
Find optimal number of clusters
ASW <- sapply(ks, FUN = function(k) {
fpc::cluster.stats(d, cutree(hc, k))$avg.silwidth
})
ggplot(as_tibble(ks, ASW), aes(ks, ASW)) + geom_line()
The maximum is clearly at 4 clusters.
hc_4 <- cutree(hc, 4)
ggplot(shapes %>% add_column(cluster = factor(hc_4)), aes(x, y, color = cluster)) +
geom_point()
Compare with ground truth with the corrected (=adjusted) Rand index (ARI), the variation of information (VI) index, entropy and purity.
cluster_stats computes ARI and VI as comparative measures. I define functions for entropy and purity here:
entropy <- function(cluster, truth) {
k <- max(cluster, truth)
cluster <- factor(cluster, levels = 1:k)
truth <- factor(truth, levels = 1:k)
w <- table(cluster)/length(cluster)
cnts <- sapply(split(truth, cluster), table)
p <- sweep(cnts, 1, rowSums(cnts), "/")
p[is.nan(p)] <- 0
e <- -p * log(p, 2)
sum(w * rowSums(e, na.rm = TRUE))
}
purity <- function(cluster, truth) {
k <- max(cluster, truth)
cluster <- factor(cluster, levels = 1:k)
truth <- factor(truth, levels = 1:k)
w <- table(cluster)/length(cluster)
cnts <- sapply(split(truth, cluster), table)
p <- sweep(cnts, 1, rowSums(cnts), "/")
p[is.nan(p)] <- 0
sum(w * apply(p, 1, max))
}
calculate measures (for comparison we also use random “clusterings” with 4 and 6 clusters)
random_4 <- sample(1:4, nrow(shapes), replace = TRUE)
random_6 <- sample(1:6, nrow(shapes), replace = TRUE)
r <- rbind(
kmeans_7 = c(
unlist(fpc::cluster.stats(d, km$cluster, truth, compareonly = TRUE)),
entropy = entropy(km$cluster, truth),
purity = purity(km$cluster, truth)
),
hc_4 = c(
unlist(fpc::cluster.stats(d, hc_4, truth, compareonly = TRUE)),
entropy = entropy(hc_4, truth),
purity = purity(hc_4, truth)
),
random_4 = c(
unlist(fpc::cluster.stats(d, random_4, truth, compareonly = TRUE)),
entropy = entropy(random_4, truth),
purity = purity(random_4, truth)
),
random_6 = c(
unlist(fpc::cluster.stats(d, random_6, truth, compareonly = TRUE)),
entropy = entropy(random_6, truth),
purity = purity(random_6, truth)
)
)
r
## corrected.rand vi entropy purity
## kmeans_7 0.638228776 0.5708767 0.2285591 0.4638565
## hc_4 1.000000000 0.0000000 0.0000000 1.0000000
## random_4 -0.003235188 2.6831737 1.9883398 0.2878286
## random_6 -0.002125389 3.0762751 1.7280983 0.1436265
Notes:
Read ? cluster.stats for an explanation of all the available indices.